-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3215] Solve UT for Spark 3.2 #4565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@YannByron can you please add PR description, so that it's a little more clear what we're trying to tackle here? |
Sure. When upgrade to spark3.2, the
|
06004c8 to
1d110f6
Compare
| Some(HoodiePayloadConfig.newBuilder.withPayloadOrderingField(preCombineField.get).build.getProps) | ||
| val properties = HoodiePayloadConfig.newBuilder | ||
| .withPayloadOrderingField(preCombineField.get) | ||
| .withPayloadEventTimeField(preCombineField.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bug if not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think yes.
If not set, upsert may get a wrong result if use DefaultHoodieRecordPaylod. And in spark3.2, if no ordering.field or event.time.field found, setting returnNullIfNotFound to true does not take effect when call HoodieAvroUtils.getNestedFieldVal, like #4169.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting returnNullIfNotFound to true does not take effect when call HoodieAvroUtils.getNestedFieldVal
this looks like a bug in getNestedFieldVal() ? is it fixable in spark 3.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not. But I also have no idea about the exception when set returnNullIfNotFound true and call getNestedFieldVal. But if PAYLOAD_ORDERING_FIELD_PROP_KEY is set automatically when preCombineField is set, which we should have done this, it works well.
| Some(HoodiePayloadConfig.newBuilder.withPayloadOrderingField(preCombineField.get).build.getProps) | ||
| val properties = HoodiePayloadConfig.newBuilder | ||
| .withPayloadOrderingField(preCombineField.get) | ||
| .withPayloadEventTimeField(preCombineField.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting returnNullIfNotFound to true does not take effect when call HoodieAvroUtils.getNestedFieldVal
this looks like a bug in getNestedFieldVal() ? is it fixable in spark 3.2?
|
|
||
| private def extractRequiredSchema(iter: Iterator[InternalRow]): Iterator[InternalRow] = { | ||
| val tableAvroSchema = new Schema.Parser().parse(tableState.tableAvroSchema) | ||
| val requiredAvroSchema = new Schema.Parser().parse(tableState.requiredAvroSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i should remove this.
| rows | ||
| } | ||
|
|
||
| private def extractRequiredSchema(iter: Iterator[InternalRow]): Iterator[InternalRow] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the name could imply returning schema
| private def extractRequiredSchema(iter: Iterator[InternalRow]): Iterator[InternalRow] = { | |
| private def toRowsWithRequiredSchema(iter: Iterator[InternalRow]): Iterator[InternalRow] = { |
| // use preCombineField to fill in PAYLOAD_ORDERING_FIELD_PROP_KEY and PAYLOAD_EVENT_TIME_FIELD_PROP_KEY | ||
| if (mergedParams.contains(PRECOMBINE_FIELD.key())) { | ||
| mergedParams.put(HoodiePayloadProps.PAYLOAD_ORDERING_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) | ||
| mergedParams.put(HoodiePayloadProps.PAYLOAD_EVENT_TIME_FIELD_PROP_KEY, mergedParams(PRECOMBINE_FIELD.key())) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the expected scenario should be: when preCombineField is set, PAYLOAD_ORDERING_FIELD_PROP_KEY is set automatically, and PAYLOAD_EVENT_TIME_FIELD_PROP_KEY is never required. It'll be good to fix these along the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I see it didn't do that and I don't know how can run correctly before.
btw, why PAYLOAD_EVENT_TIME_FIELD_PROP_KEY is never required. After all, DefaultHoodieRecordPayload will use this for now.
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YannByron let me clarify my thoughts:
- event time field can be different from ordering / precombine field: the former is used to calculate latency/freshness; it can be another field say
occurred_atand the latency will be calculated ascommit_time - occurred_at, while the records are ordered or precombined usingupdated_at - there are users who need to set event time and there are those who don't care. For the latter, we can't just set it as the ordering field; we should leave it not set and then nothing will be calculated or persisted in commit metadata
- as you reported "setting returnNullIfNotFound to true does not take effect", then we should fix that instead of masking it by setting event time
WDYT?
|
@xushiyan very appreciate for your explanation. I'll restore the origin logic. |
|
@xushiyan what's the difference between |
1d110f6 to
f619879
Compare
@xushiyan I remove the codes about setting event.time by precombine.field, but I keep setting ordering.field by precombine.field, is it right? |
@YannByron yes we should tackle |
Tips
What is the purpose of the pull request
(For example: This pull request adds quick-start document.)
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.